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Getting Teacher Assessment Right: 
What Policymakers Can Learn from Research 

Patricia H. Hinchey, Penn State University 



Executive Summary 

It is well established that teacher quality makes a difference in student learning. Since the 
implementation of No Child Left Behind in 2002, staffing every classroom with a high-quality 
teacher has been an official national priority. That goal entails an implicit requirement to assess 
teacher and teaching quality more riqorously than has been the case in the past. Despite decades 
of research on how best to assess teacher performance, however, no consensus has evolved on 
any single assessment strategy or collection of strategies— indicating that the problem of 
designing adequate and appropriate assessment is inherently complex and controversial. Such 
complexity has not, however, prevented the Obama administration from encouraging 
policymakers to define “good” teachers as those who produce gains in student achievement, 
measured by gains in standardized test scores. 

Notwithstanding the federal enthusiasm for test scores, many researchers have warned against 
using a single measurement of any kind as the primary basis for such important personnel 
decisions as teacher retention, dismissal or pay. While there are important questions about what 
achievement scores can— and cannot— indicate about individual teachers, there is no question 
that placing excessive emphasis on test scores alone can have unintended and undesirable 
consequences that undermine the goal of developing an excellent teaching force. 

Given the experience to date with an overwhelming focus on student achievement scores as a 
basis for high- stakes decisions, policymakers would do well to pause and carefully examine the 
issues that make teacher assessment so complex before implementing an assessment plan. To 
facilitate such examination, this brief reviews credible research exploring: the feasibility of 
combining formative assessment (a basis for professional growth) and summative assessment (a 
basis for high- stakes decisions like dismissal); the various tools that might be used to gather 
evidence of teacher effectiveness; and the various stakeholders who might play a role in a 
teacher assessment system. It also offers a brief overview of successful exemplars. 

Based on the research reviewed, it is recommended that policymakers employ an assessment 
system that targets both continual improvement of the teaching staff and timely dismissal of 
teachers who cannot or will not improve. Steps toward that goal include that policymakers: 

• Be clear about the purposes of any assessment before selecting strategies. 
Where formative and summative assessment are to be combined, plan to 
address the challenges of dual-purpose systems. 

• Involve all key stakeholders in system design. 



Rather than employing a single assessment tool, gather evidence from 
multiple sources. Combine strategies so that the weakness of any single tool 
is offset by the strengths of another. 

Be sure that the criteria for assessing performance, artifacts or other factors 
are credible and are well understood by teachers and assessors. 

Provide high-quality, ongoing training for assessors and routinely calibrate 
their efforts to ensure consistent application of criteria. 

Look to high-quality research on existing tools and programs to inform the 
design of assessment systems. 

Commit sufficient resources to produce high-quality, productive 
assessment. 



Getting Teacher Assessment Right: 
What Policymakers Can Learn from Research 



Introduction 

It is well established that teacher quality makes a difference in student learning. 1 Since the 
implementation of No Child Left Behind in 2002, staffing every classroom with a high-guality 
teacher has been an official national priority. That goal entails an implicit requirement to assess 
the guality of teachers and teaching more rigorously than has been the case in the past. 2 Despite 
decades of research on how best to assess teacher performance, however, no consensus has 
evolved on any single assessment strategy or collection of strategies— indicating that the problem 
of designing adeguate and appropriate assessment is inherently complex and controversial. Such 
complexity has not prevented the Obama administration from encouraging policymakers to define 
“good” teachers as those who produce gains in student achievement, measured by gains in 
standardized test scores. His Race to the Top initiative, which offers competitive grant money 
rewarding states that link achievement data to individual teachers, has already prompted some 
states to pass laws mandating that teacher evaluation be tied to student achievement. 3 

Notwithstanding federal enthusiasm for test scores, many researchers have warned against 
using a single measurement of any kind as the primary basis for such important personnel 
decisions as teacher retention, dismissal or pay. 4 While there are important questions about 

Policymakers who are considering employing test scores as the primary 
tool for teacher assessment would do well to pause and carefully examine 
research evidence. 

what exactly achievement scores can— and cannot— indicate about individual teachers, there is 
no question that placing extreme emphasis on test scores alone can have unintended and 
undesirable conseguences that undermine the goal of developing an excellent teaching force. 
NCLB’s emphasis on high- stakes testing, for instance, has led not only to widespread cheating, 
but also to such counterproductive practices as school personnel encouraging academically 
struggling students to transfer or drop out. 5 While such practices might have led to higher 
achievement scores, no one would consider a teacher who promoted cheating or dropping out a 
“good” teacher. 6 The eguation of higher test scores with high- quality teachers and teaching 
ignores such complications and their potential for harming students. 

Given past drawbacks with basing high- stakes decisions exclusively on student achievement 
scores, policymakers who are considering employing test scores as the primary tool for teacher 
assessment would do well to pause and carefully examine research evidence. To facilitate such 
examination, this brief explores the research on several key guestions: 
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• What exactly is to be assessed— and for what purposes? 

• What measurement tools are available, and what are their strengths and weaknesses? 

• Who is to do the assessing? 

• What systemic models has research shown to be viable for credible and comprehensive 
assessment of teachers and teaching quality? 

This research review provides a basis for concluding recommendations for policymakers. 



Methods 

Because of the current interest in using test scores as a basis for teacher dismissal, this brief 
focuses on the assessment of current classroom teachers. It therefore does not include an 
exploration of related research concerning assessment issues in teacher education, initial 
certification, and hiring. 

The studies reviewed here are primarily research articles from peer-reviewed journals, although 
a few credible research reports from other sources are also included. Overall, 275 articles and 
reports were examined as potentially relevant to this review. 7 

What Exactly Is to Be Assessed— And For What Purposes? 

High- stakes assessment can be problematic. That which is assessed is often distorted, and that 
which is not assessed is often neglected. When mandatory testing included only reading and 
math, for example, many schools narrowed the curriculum to those subjects at the expense of 
others, like science. 8 Because priority setting drives behavior, before asking how to assess 
teachers, it is essential to ask what is so important to teacher and teaching guality that it must 
be evaluated. Possibilities go well beyond test scores and range widely, from deep (untested) 
student learning and such teacher traits as honesty to specialized knowledge and skills, such as 
howto adapt learning activities for special- needs students. 

What to assess is not, however, the only preliminary consideration. J ust as it is often assumed 
that student achievement is a logical and sufficient way to assess teachers, it is also widely 
assumed that the point of such assessment is to make high- stakes personnel decisions. However, 
the question of how to use assessment data is more complex than it may first appear. Since the 
purpose of assessment also has implications for the choice of assessment tools, a discussion of 
two basic types of assessment— formative and summative— follows the discussion of assessment 
categories below. 

Assessment Categories 

Despite many earlier efforts to develop one, there is no agreed-upon definition of teacher 
quality. Recently, however, several researchers have worked to clarify relevant factors. 9 
Although terminology and specific categorizations vary in the literature, three common 
categories emerge: teacher quality, teacher performance, and teacher effectiveness (see Figure 
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1). Teacher quality refers to teacher characteristics such as education, experience, and beliefs. 
Teacher performance refers to what a teacher does, both inside and outside the classroom, and 
includes such elements as classroom interaction with students and collaborative activity with 
parents and others in the school community. Teacher effectiveness refers to teacher influence on 
student learning and includes such elements as student test scores and student motivation. Each 



Teacher Quality 



Teacher Performance 



Teacher Effectiveness 



Personal traits, skills, and 

understandings. 

• Education, experience, 
credentials, licensure 

• Content and pedagogical 
knowledge, including the 
ability to match pedagogy 
to context 

• Understanding of learners 
and their learning and 
development, including of 
specific populations like 
English Language Learners 

• Dispositions, beliefs, 
expectations, values 



Teacher activities 



• Classroom activities and 
interaction between 
students and teachers 

• Learning activities 
provided or mentored 
outside the classroom 

• Teacher activities outside 
the classroom, in the 
school and the community 



Teacher effects on students 

• Student achievement 

• Graduation rates 

• Student attitudes, 
behavior, motivation, 
social and emotional well- 
being 



of these categories has potential for informingjudgments about teachers and teaching; each 
appears routinely in research literature, although different researchers may define the same 
term a bit differently. 



Figure 1. Categories of Teacher Assessment 

Teacher Quality 

Teacher guality can be thought of as those attributes the teacher brings to the classroom, 
including specialized knowledge. Some factors often included in this category (education, 
certification/ licensure, and experience, for example) are freguently considered primarily during 
hiring, and so lie beyond the scope of this brief. However, there is widespread recognition that 
other personal Qualities of teachers are important. Standards from both the National Council for 
Accreditation of Teacher Education (NCATE) and the Interstate New Teacher Assessment and 
Support Consortium (INTASC), for example, detail expected “dispositions.” 10 Surveying existing 
literature, Thornton (2006) found that dispositions “often loosely equate to values, beliefs, 
attitudes, characteristics, professional behaviors and qualities, ethics and perceptions.” 11 A 
common assumption is that teachers should be reflective, habitually monitoring their 
effectiveness and planning improvements. 12 

In the constellation of teacher characteristics receiving attention, teacher beliefs about students’ 
capacity to learn are a particular concern because they shape a teacher’s classroom choices. 13 
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Recent studies have linked achievement gaps with negative teacher beliefs about students of 
color, students from low socioeconomic backgrounds, or both. 14 Changing teachers’ negative 
preconceptions might even change classroom practice and help narrow achievement gaps. 15 

However, research has not as yet established the Ml complement of teacher characteristics that 
may affect student achievement. As Munoz and Chang (2007) aptly summarize, “Teacher 
characteristics and student growth have an elusive relationship, but practice in the classrooms 
tells us that they are two intertwined concepts.” 16 As these researchers note, policymakers will 
need “to make the best decision based on their particular context” about which teacher 
characteristics might be important to assess. 

Teacher Performance 

Teacher performance can be thought of as those things a teacher does, both inside and outside 
of the classroom. Because specialized knowledge does not automatically translate to effective 
classroom performance, it is necessary to assess not only what a teacher knows but also what a 
teacher can do. Teacher performance thus includes such instructional basics as how well a 
teacher plans learning activities, maintains a positive classroom environment, communicates 
with students, and provides productive feedback. It also includes activities outside the 
classroom, such as advising student groups, taking part in committees and other school- wide 
work, and communicating with parents. 

To assess teacher performance requires having a set of performance criteria. For example, 
elements that Goe, Bell & Little (2008) consider essential include whether teachers “use diverse 
resources to plan and structure engaging learning opportunities; monitor student progress 
formatively, adapting instruction as needed . . . collaborate with other teachers, administrators, 
parents, and education professionals to ensure student success, particularly the success of 
students with special needs and those at high risk for failure.” 17 Kennedy ( 2008 ) includes as 
examples of relevant classroom practices “being organized, providing clear goals and standards, 
[and] keeping students on task”; as examples of typical practices outside the classroom, she 
includes “interacting with colleagues and parents, planning a curriculum that engages students, 
providing supervision to the chess club.” 18 

Assessing a teacher’s activities requires specifying clear criteria for desired behaviors. Often 
such criteria reflect the standards of professional organizations; many models are available. 19 To 
allow for variability in the teaching context, some models phrase expectations broadly enough to 
cover a wide range of activities in a wide variety of classroom contexts. 20 A broad goal, for 
example, might be that teachers “clearly state the goal of each class when it begins.” Narrower 
guidelines are available in discipline- specific teaching standards formulated by several 
professional organizations, including those used for accrediting teacher education programs. 21 
A discipline- specific criterion for language arts teachers, for example, might be that they 
“provide students practice in identifying and correcting common grammatical errors in their 
writing.” Because assessment criteria can shape classroom behavior if performance assessment 
is well implemented, policymakers should choose them with great care. 22 
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Teacher performance can be assessed. Heneman and colleagues (2006) 23 reviewed several 
studies of four sites implementing a well-known set of criteria (Danielson’s 1996 Framework for 
Teaching) 24 and found that “the scores from standards-based performance evaluation systems 
can have a substantial positive relationship with student achievement and that the instructional 
practices measured by these systems contribute to student learning.” There is also evidence, 
however, that the validity of evaluations varies significantly across evaluators, suggesting the 
importance of providing extensive training for evaluators and monitoring the credibility of their 
judgments. 25 

Teacher Effectiveness 

Teacher effectiveness can be considered the result of a teacher’s activities. It encompasses a wide 
range of outcomes, obviously including student learning. Academic achievement is critical, but 
as noted earlier, defining teacher effectiveness only in those terms ignores several other 
important ways that teachers affect students and the school community. The limitations of 
assessment based on student achievement are amplified when achievement is measured only by 
standardized test scores, with no consideration of such other classroom data as student projects, 
performances, papers, learning logs, and the like. 

The current enthusiasm for using student test scores as the sole measure of teacher effectiveness 
stems from several sources— including convenience. Test score data are readily available because 
of NCLB reguirements, and non- statisticians often perceive statistical analyses as objective, 
simple and reliable. Moreover, federal policy attaches high stakes to high scores, forcing school 
personnel to value them highly. 

Also fueling the interest in test scores is the development of value-added modeling, which 
increases the capacity of researchers to isolate the effect of a single teacher from other influences 
on student achievement (such as prior teachers, home influences, school environment and 
student motivation) . This modeling is sometimes known as Value-Added Assessment (VAA), 
and it uses complex formulas to estimate students’ likely achievement gains in a given year. 
Actual gains are compared to this estimate, and classroom teachers are credited (or blamed) 
when students experience greater (or lesser) gains than expected. 

However, while various VAA options exist, none is perfect: 

Trade-offs and risky assumptions are reguired in every case, so any given model is 
necessarily going to be imperfect. In the context of accountability, expectations for what any 
VAA- based tool can reasonably accomplish should be tempered, and the use of its estimates 
must be judicious. 26 

A steady stream of authoritative statements from the nation’s foremost researchers has 
cautioned against the use of VAA to make high- stakes decisions, 27 both because of remaining 
methodological challenges and because “an overly narrow focus on standardized test scores as 
the most important— and in some cases, only— student outcome measure is not aligned with 
what the field agrees an effective teacher does.” 28 Some researchers suggest including other 
important outcomes, such as whether students persist to graduation and whether they 
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demonstrate a positive attitude toward learning, toward themselves, and toward others. Another 
concern is whether students evidence a sense of engaged citizenship. 29 

A disincentive for including these latter types of attitudinal outcomes as a measure of teacher 
effectiveness comes from critics who have complained that the most important purpose of 
schools— to develop students’ academic talents— has been elbowed aside by efforts to enhance 
students’ self-esteem. 30 However, in 2009 the Educational Testing Service (ETS) sponsored a 
survey of existing research on the influence of noncognitive variables that found substantive 
empirical evidence indicating a correlation between achievement and student engagement (a 
category that includes such factors as student values and feelings) . 31 That correlation was 
especially strong for reading and math. 

While it is obvious that student learning should be factored into any assessment of teacher 
effectiveness, the overwhelming conclusion of top researchers is that value-added assessment 
alone is an invalid and unwise basis for making for high- stakes decisions. J ust as teacher 
effectiveness should be combined with teacher guality measures and teacher performance 
measures, any measurement of teacher effectiveness that uses VAA should combine it with 
analyses of other evidence, such as classroom artifacts, student self-reports, parent surveys, and 
other key non- academic outcomes known to correlate with student learning. 

Assessment Purposes: Summative and Formative 

There are two very different purposes for assessment, each critical in its own right. Summative 
assessment is used to make a judgment, often a high- stakes decision— whether to award a 
teacher merit pay, for example, or whether to continue or terminate a teacher’s employment. In 
contrast, formative assessment is used to gain information that can help teachers, even teachers 
who are already proficient, to improve or expand their abilities. Developing an excellent 
teaching force reguires not only making good decisions about which teachers enter and remain 
in classrooms, but also finding ways to help teachers improve their skills. 32 

More than two decades ago, J ames Popham ( 1988) argued that each type of assessment is 
“splendid” in itself, but that they are “counter-productive when combined.” 33 He summarized 
formative assessment as “fixing” the teacher, and summative assessment as “firing” the teacher, 
noting that “From the perspective of the teacher who is being fixed or fired, that distinction is 
profoundly important.” 34 Assessment to improve practice reguires that teachers be open to 
admitting weaknesses, which can happen only in a relatively non- threatening environment. In a 
formative situation, the evaluator functions as an ally, providing help to improve performance. 
When important career decisions are also to be based on the evaluation process, however, the 
environment may seem fraught with risk, especially for a teacher having significant difficulty. 35 
In this case the evaluator functions as a potential enemy able to derail a career, and the 
assessment process may seem hostile. Teachers whose work can be improved but who are 
feeling at risk may understandably be inclined to hide, rather than confront, their problems— 
precluding valuable formative feedback. 

Despite the inherent challenge of combining these assessment functions, a single system is 
freguently expected to serve both purposes, and often a single person— usually the principal— is 
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responsible for the assessment. Notwithstanding Popham’s skepticism, one recent study 
suggests it may be possible. Milanowski (2005) divided new teachers in one district into two 
groups, one that received summative and formative feedback from a single source and one that 
received each type of feedback from a different source. He found “no major differences ... in 
terms of openness to discussion of difficulties, reception and acceptance of performance 
feedback, stress, turnover intentions, actual turnover, or performance improvement.” 36 

Moreover, some systems specifically designed to address both purposes have been successful. 
For example, locally developed systems of Professional Development Plans (PSPs) have shown 
promise for dual-purpose evaluation of experienced, competent teachers. 37 Also, peer assistance 
and review (PAR) strategies have shown promise for combined evaluation of both new and 
veteran teachers. 38 Other dual purpose systems have also been successful. 39 

While it appears that summative and formative assessment maybe successfully combined, 
policymakers should remain aware of the challenges involved in doing so and address them as 
they plan. 



What Measurement Tools Are Available? 

Once what is to be assessed has been determined, policymakers can proceed to consider which 
measurement tools to use. 40 Several can be combined into comprehensive systems that assess 
multiple elements and provide multiple forms of data and judgments. Often, one tool can help 
offset weaknesses in another. For example, value-added assessments offer some information 
about student achievement but no information about what a teacher did to produce greater- 
than- expected (or less- than- expected) learning gains. Teacher observations, portfolios, and self- 
reports on classroom practice can help illuminate the important question of how gains were 
realized or losses were suffered. Moreover, these additional information sources may document 
high- quality teaching notwithstanding poor VAA results, and vice versa. 

In a recent ambitious synthesis of research on teacher effectiveness, 41 Goe and co-authors 
organized assessment tools into seven categories and provided a useful table summarizing the 
purposes, benefits and drawbacks of each (reproduced in the Appendix below). Categories 
discussed here essentially parallel those of Goe and her colleagues, except that I have collapsed 
principal observation and classroom observation, yielding six (rather than seven) categories: 
classroom observation, instructional artifacts, portfolios, teacher self-reports, student surveys, 
and value-added assessment . 42 

Classroom observation 

Classroom observation has long been a common method of teacher assessment, largely because 
it offers rich detail on a teacher’s actual performance that can be used for both formative and 
summative purposes. 43 Classroom visits often take a class period or its equivalent, and 
procedures may be informal or highly structured to include the use of pre- and post- observation 
conferences. 44 Observers usually record their impressions of classroom events and 
characteristics, but calls for more objective evaluation have led to widespread use of observation 
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protocols. Many protocols are available, but while validity has been assessed for some of them, 45 
Goe and colleagues caution that “[t]he degree to which observations can or should be used for 
specific purposes depends upon the instrument, how that instrument was developed, the level of 
training and monitoring raters receive, and the psychometric properties of the instrument.” 46 

Based on their research, Kimball &Milanowski (2009) also caution that 

Providing evaluators with relatively detailed rubrics or rating scales describing generic 
teaching behaviors thought to promote student learning, coupled with initial training in 
applying them, is not enough to ensure that all evaluators’ ratings will be positively related 
to student achievement. 47 

Confounding factors can include whether observers, especially principals, have enough time to 
do a thorough classroom assessment, whether they have sufficient familiarity with the wide 
variety of subjects and grades they must assess, and whether they are adeguately trained in the 
use of the instruments. 48 Some worry that protocols may force observers to base judgments on 
overly narrow and prescriptive lists of teacher behaviors. 49 Ongoing training is necessary to 
ensure that observers apply criteria consistently over time. 

Instructional artifacts 

Instructional artifacts include a wide range of classroom- related materials, such as lesson plans, 
assignments, handouts, student work (including class work, homework, projects, and exams), 
scoring rubrics, and pictures of such classroom elements as writing on the board. Like 
observation, artifacts offer authentic evidence from the classroom and provide substantive detail 
on actual classroom activity, but analyzing them is less time consuming than is classroom 
observation. And, while teachers must spend time selecting artifacts, they need not generate any 
new materials. 

As with observation, protocols are available to guide analysis. Different protocols target different 
criteria, such as how well materials reflect standards or the degree of intellectual challenge. 

Little peer- reviewed research has yet been conducted on artifact analysis as a credible means of 
teacher assessment, and criticisms of it have not yet been resolved. 50 Yet there is some evidence 
of promise for this measurement tool. For example, the National Center for Research on 
Evaluation, Standards, and Student Testing at UCLA has conducted several pilot studies on an 
instrument it has developed, the Instructional Quality Assessment, and found correlations with 
observation assessments, student work, and standardized achievement scores. 51 Researchers at 
the Consortium on Chicago School Research have similarly developed the Intellectual Demand 
Assignment Protocol; one study of this instrument found positive correlations between higher- 
scoring assignments and higher student achievement. 52 Pilot studies of a tool called the “Scoop 
Notebook” have also shown promise. 53 

More research is needed, but artifact analysis may be an informative part of a broader 
assessment system. 
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Portfolios 



Portfolios indude classroom artifacts, like those listed above, as well as a broader range of 
materials, such as samples from a teacher’s journal or a statement of personal teaching 
philosophy— materials not in evidence in the classroom but nonetheless relevant to the teacher’s 
activities. 

The careful selection of evidence to build a coherent portrait of classroom performance requires 
extensive reflection by the teacher; formative evaluation and ongoing professional development 
are an inherent part of the portfolio approach. Ideally, teachers build portfolios gradually over 
time, so that their growth is evident. As for other tools, it is essential that teachers and assessors 
both have a dear understanding of the criteria by which the portfolio will be judged. 

In a review of contextualized assessment tools (2000), Darling- Hammond and Snyder 
summarize the potential of portfolios: 

As assessment tools, portfolios that are structured around standards of practice are able to 
examine a teacher’s practice both in context and in the light of a common set of expectations 
and benchmarks. By giving assessors access to teachers’ thinking as well as to evidence of 
their behaviors and actions (e.g. through videotapes, lesson plans, assignments, and the 
like), portfolios permit the examination of teacher deliberation, along with the outcomes of 
that deliberation in teacher’s actions and student learning. 54 

Because of the rich potential of portfolios to provide insight into multiple facets of the teacher’s 
performance, they have become increasingly popular. Vermont, Connecticut, Washington state 
and Wisconsin have all adopted portfolio- based teacher assessment systems at some level. 55 

The very complexity of portfolios, however, can make them difficult to assess, and more research 
on their reliability and validity for assessment purposes is necessary before they should play a 
major role in accountability systems. 56 Links between portfolios and achievement have been 
found in the National Board for Professional Teaching Standards (NBPTS) system (discussed 
below), but other studies have not established a connection. 57 

Teacher self-reports 

Teacher self-reports can be extremely valuable because teachers have unique, detailed 
information on such important elements as classroom context and teacher intentions. For 
example, observers can say what a teacher did but may have little understanding of why, an 
important consideration when assessing whether instruction has been effective or whether the 
teacher makes good instructional decisions. Moreover, self-reports can offer insight into the 
findings of other assessment measures, such as achievement scores, and so help identify 
appropriate professional development or other improvements. 

Self-reports can take several forms, including surveys, teaching journals or logs, and interviews. 
These reports may be relatively unstructured or highly structured, and they may explore fairly 
generic topics (such as assessment practices) or very specific ones (such as how a particular math 
concept is taught) . Studies on the validity of self- reports have yielded mixed results, and some are 
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concerned about the normal human tendency to make favorable self-reports. 58 In addition, there 
is some evidence that data collected only annually, which is highly dependent on long-term 
memory, is less reliable than data collected more freguently, as when teachers report on a specific 
day at its end. 59 Another concern is that teachers and others may have different understandings of 
the same terms ( challenging or successful, for example), confounding results. 60 

Questions of validity concerning self-reports preclude using them as a primary basis for high- 
stakes decisions. However, they are relatively inexpensive, can yield detailed information useful 
in both formative and summative assessment, and can promote reflection and professional 
development. Moreover, incorporating teacher self-reports conveys the important message that 
the contextual knowledge of practitioners is respected and valued, and so helps to promote 
stakeholder buy-in. 

Student surveys 

Although some adults may have reservations about the ability of students to assess their 
teachers, there is some persuasive evidence that student surveys, even at the elementary school 
level, can be valid sources of information. A 1995 study based on a review of research on 
elementary students’ teacher ratings found evidence that elementary students are “no more 
vulnerable than others to rating leniency and halo” (extending one positive characteristic into a 
positive global rating) , 61 Like teachers, students have unigue knowledge of the classroom. 

One study involving nearly 1,000 teachers, 35 teachers and four principals found student ratings 
of teachers to be good predictors of student achievement as measured by the district’s criterion- 
referenced examinations. 62 The guality of a survey instrument used will, of course, affect results. 
In this case, researchers worked with instruments that had demonstrated validity and reliability 
in prior research. Another study involving over 400 teachers in 27 K- 12 schools similarly found 
that student ratings were both reliable and valid. The researchers concluded that student 
assessments “are not popularity contests” and that students can and do “distinguish between 
merely liking a teacher and recognizing one who enables their learning.” 63 

As with other instruments, even proponents of student ratings don’t recommend using them as 
the sole source of information, but rather as part of a comprehensive assessment system. 

“[H]igh student ratings do not necessarily mean the same thing as good teaching. Perhaps the 
best interpretation is that high student ratings in conjunction with at least several other 
indicators are a good indicator of guality teaching.” 64 

Value-added assessment 

As noted above, value-added assessment (VAA) is a means of measuring how much academic 
growth can be linked to a particular teacher. Complex formulas predict the amount of growth for 
students in a given year, and particular teachers are assumed to be responsible for students 
meeting, exceeding, or missing the expected gains. 

There are many influences on student learning beyond the teacher, however, and early versions 
of VAA were criticized as unable to control adeguately for such influences as the socioeconomic 
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status of students and schools and for the validity of the tests used to measure achievement. 65 
Researchers have been working for years on improving VAA formulas, and enough progress has 
been made to allow for distinguishing between particularly strong and particularly weak 
teachers— as long as one accepts the importance of test scores as an outcome. 66 However, no 
perfect formula has yet been devised: 

The myriad factors that influence cognitive growth over extended periods of time, the 
purposeful sorting of families and teachers into schools and classrooms, compensatory 
behavior on the part of families, and the imperfections of tests as measures of knowledge 
complicate efforts to estimate measures of teacher effectiveness, including the overall 
variance of teacher value added and the ranking of teachers by guality of instruction. Even 
within- school rankings are subject to biases and the vagaries of sampling variability. Along 
with possible distortions of classroom time allocation and teaching methods in an effort to 
increase scores, these problems raise concerns about the use of tests for high stakes 
purposes (p. 534). 67 

An additional limitation of VAA, as noted earlier, is that it provides no information on what a 
teacher may or may not be doing to produce specific scores, and therefore it therefore offers no 
information helpful for improving practice. This weakness may, however, be offset by 
complementary observational data. 68 

Like the other tools catalogued here, VAA may provide useful information as part of a broader 
assessment system using multiple sources of data, but is not in itself a reliable method for 
assessing teacher performance. 69 



Who Should Assess? 

For many years, teacher assessment was routinely the responsibility of the principal (or 
assistant principal) . However, as suggested earlier, relying on a single administrator for teacher 
assessment has proven problematic and has been criticized as well for failing to identify weak 
teachers. 70 While some recent research suggests that principals can be effective assessors, there 
has been growing interest in newer alternatives— fueled not only by weaknesses in the 
traditional approach but by increasing calls for teachers to monitor peer performance, as is 
common in other professions. 71 Several systems have evolved that distribute responsibility for 
high- stakes decisions among multiple stakeholders and that give teachers themselves key roles 
in both formative and summative assessment. 

Thus, policymakers designing assessment systems must make choices about who will be involved 
in teacher evaluation. This section provides a brief review of research relevant to that issue. 

Principal Ratings 

Summa rizin g criticism of traditional evaluation by principals, Calabrese and colleagues (2004) 72 
identified several common themes: that principals’ ratings do not adequately identify poor and 
marginal teachers; that the process is a time- wasting ritual with little or no effect on personnel 
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decisions and staff development; and that experienced teachers are often dissatisfied with 
evaluators’ skill and feedback as well as their failure to connect assessment to professional 
development. J ust as it is important to take contextual influences into account when considering 
student achievement, however, it is also important to consider how context may influence 
principal ratings. 

In their study involving 80 classroom teachers and eight principals, Calabrese and colleagues 
found that both groups had negative feelings about their district’s traditional “top down” 
assessment process. However, in analyzing subjects’ comments, the researchers identified the 
mandated use of a particular rating instrument and the lack of opportunity to provide detailed 
and useful feedback as the real problem. While teachers often blamed principals for stressful 
and ineffective evaluations, principals saw themselves as victims of a system imposed upon 
them that they had no voice in designing. Thus, this study suggests that the weaknesses often 
ascribed to principals may be linked instead to a poor, externally imposed process. 

Lik e many of their colleagues throughout the US, [teachers in this study] endure evaluation 
systems based on reward or punishment. Teachers endure this process and continue to 
develop a deepening resentment toward principals who are systematically forced to 
participate. ...Principals. ..found themselves caught in an enigma. On the one hand, they 
desired the less adversarial formative role; on the other hand, they had no choice, but to 
operate as the summative evaluator. 73 

This study suggests that if the assessment process were more collaborative, principal ratings 
might be more useful and better received. The teachers and principals surveyed professed the 
same goals for assessment— accountability and an effective aid to professional growth— but the 

Although context is critical, and although some principals may be 
uncomfortable giving strong negative feedback, principals can provide 
valuable assessments. 

imposition of a rigid structure subverted them. 74 This would suggest that a preliminary concern 
with relying on principals for evaluation is the need to ensure good conditions. Principals 
themselves appear to believe that major barriers to effective evaluation include insufficient time, 
tenure, and restrictive rules. 75 The strength of an observational protocol or other data collection 
tool can also affect principals’ ratings, as can the amount and quality of training a principal 
receives (if any) in collecting and interpreting data, the consequences of the evaluation, and 
whether the principal is held accountable for the quality of the evaluations. Another concern is 
whether a principal has the necessary subject-area knowledge for all disciplines. 

Although context is critical, and although some principals may be uncomfortable giving strong 
negative feedback, 76 principals can provide valuable assessments. In a 2008 study, J acob and 
Lefgren found that principals can reliably identify both the weakest and strongest teachers, even 
though they are less able to make fine distinctions in the middle range. 77 In these researchers’ 
view, the “findings provide compelling evidence that good teaching is, at least to some extent, 
observable by those close to the education process, even though it may not be easily captured in 
those variables commonly available to the econometrician.” 78 They also note that principal 
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observation can mitigate concerns about teachers pursuing improved test scores at the expense 
of meaningful learning. 

Research on the correlation between principal ratings and student achievement is mixed, 
however, and it is possible that using value-added measures and principal observation together 
might better predict student achievement than using either alone. 79 

Peer Review 

Historically perceived as women’s work, teaching has suffered from both low status and low pay. 
Yet there has been growing support for recognizing teachers as skilled professionals uniguely 
qualified to assess their colleagues, as do professionals in other fields. Calls for a more 
collaborative assessment process with more emphasis on professional development have fueled 
interest in— and often union support for— involving teachers in assessment. 80 There is, however, 
scant empirical research on peer review, which can take many different forms. To offer a sense 
of the possibilities for peer assessment and of the existing research, two of the best- known 
programs are briefly described here. 

Peer Assistance and Review (PAR) 

Peer Assistance and Review, widely known as PAR, first appeared in the early 1980s in Toledo, 
Ohio. The design calls for a joint union- administration panel to administer a program in which 
experienced, highly skilled teachers serve as mentors and primary assessors for new teachers, 
for veteran teachers having difficulty, or for both. Because the reviewing teachers, often called 
consulting teachers or CTs, are released from their own classrooms, there is substantial cost 
involved for classroom replacements. 

The PAR system depends on a dear system of expectations and all stakeholders having a shared 
understanding of them. CTs both help mentees meet standards and assess their progress; their 
recommendations to rehire or terminate a teacher carry great weight with the oversight panel 
making the ultimate dedsion. 

Credible research on PAR is growing, with preliminary findings showing that, while the cost per 
teacher in a PAR program was $4,000 to $7,000, it also created savings including effidendes 
from higher retention of new teachers and lower arbitration and dismissal costs. 81 In addition, 
stakeholders “felt strongly that PAR not only was a worthwhile investment but that it also saved 
the district money.” 82 

Additional research has focused on implementation of 1999 PAR legislation in California. In one 
district studied for several years, PAR increased dismissals, and “[t]he community of educators 
created by PAR and the PAR panel appears to have proved a more rigorous, evidence- based 
check on classroom teaching performance.” 83 The accountability provided by the oversight panel 
appears crutial in providing support for the CTs who provide summative evaluations of 
colleagues. 
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Other Exemplars Employing Peer Review 

While PAR operates completely within a district, peer review may also be structured through 
external agencies. In Connecticut, for example, the state’s Beginning Educator Support and 
Training (BEST) program reguires beginning teachers to submit second- year portfolios, scored 
by a cohort of experienced teachers that the state trains and employs as assessors. An in-depth 
study of BEST has found substantive gains in student achievement in mathematics and literacy, 
which appear to be linked to Connecticut’s teacher assessment policies. 84 - 85 

While these programs tend to focus on less experienced teachers, the National Board for 
Professional Teaching Standards (NBPTS) offers national certification to highly skilled teachers 
through another type of peer- review process. Applicants must take subject-matter tests and 
submit a detailed portfolio that includes such materials as classroom videotapes, written 
analyses of teacher objectives, and artifacts demonstrating student learning. The portfolios are 
reviewed by teachers accomplished in the same subject and at the same high experience and 
skill level as the candidates. 

Research on the outcomes of NBPTS certification is mixed, 86 but when the National Research 
Council reviewed all available studies and issued a report, it concluded that “national board 
certification distinguishes more effective teachers from less effective teachers with respect to 
student achievement. The differences are small (and not entirely consistent) in absolute terms, 
but when considered in terms of teacher value-added contributions to achievement, they are 
substantively meaningful.” 87 

As is evident from such programs, teachers themselves may play an important role in the 
assessment of their peers. Those designing comprehensive teacher evaluation systems should 
seriously consider including this element. 



Systemic Models 

Because every tool for assessing teacher performance has both strengths and weaknesses, and 
because assessment can have multiple goals, it is better to develop a comprehensive assessment 
system than to adopt a single measure of performance. Several such systems have already been 
developed, and experience suggests they have promise for helping to nurture and promote a 
highly skilled teaching staff. 

Given the recent interest in merit pay, the National Education Association (NEA) recently 
commissioned a review of research literature on linking assessment systems to teacher 
compensation. The review paid particular attention to the impact of specific assessment systems 
on both student achievement and achievement gaps. 88 It identified five programs as “promising 
approaches to improving instruction, raising student achievement, gaining teacher support, 
increasing retention by taking a comprehensive rather than piecemeal approach to reform, and 
centering activities and procedures around instructional improvement and student learning.” 89 
Following is a brief summary of that study’s findings on each program. 90 
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Three of the models identified as promising have already been discussed here. Danielson’s 
Framework for Teaching (FFT) has the longest history and appears most often in the research 

It is important to note that outcomes are determined by implementation 
as well as by design. 

literature, and its scores have been found to be positively correlated with value-added measures 
of student achievement. PAR research has found major advantages to distributed responsibility 
for personnel decisions and extensive contact between consulting teachers and mentees. 91 
Connecticut’s BEST has also been found to have positive results. 92 

The remaining two promising programs identified in the NEA report are the Teacher 
Advancement Program (TAP) and Denver’s Professional Compensation System (ProComp). TAP 
was designed by Lowell Milken of the Milken Family Foundation. It integrates assessment 
within a system linking accountability to compensation. 93 While some results are promising, the 
program is complex, and more outside research is needed on its effects over time and in more 
varied contexts. 94 The ProComp system, developed collaboratively by the Denver district and 
union leaders, is also tightly linked to compensation. 95 It stresses teachers developing and then 
striving to meet high-guality objectives for student learning, and it financially rewards teachers 
for realizing them. Olivia Little, the author of the NEA report, dtes as particular advantages of 
ProComp its flexibility, choice and varied options, and she finds the model informative in terms 
of fostering collaboration and stakeholder support for a new assessment system. An 
independent assessment of ProComp’s effects on achievement is underway; a preliminary report 
has identified some positive trends in outcomes. 96 

An earlier guide for policymakers by Linda Darling- Hammond and Cynthia Price (2007) 97 also 
noted several promising systems. In addition to BEST, TAP, PAR and ProComp, these 
researchers cite the NBPTS’ national certification as an effective assessment. 

While research indicates promise for these programs, it is important to note that outcomes are 
determined by implementation as well as by design. Different schools or districts may 
implement a program with greater or lesser fidelity to the design and with more or less 
commitment to key components. 



Discussion 

A teacher evaluation system focused solely on high- stakes decisions like tenure or compensation 
will not meet contemporary needs. If each classroom is to be staffed with a highly skilled 
teacher, an assessment system must do more than weed out weak teachers. As explained by 
Darling- Hammond and Prince: 

Clearly, meeting the expectation that all students will learn to high standards will reguire a 
transformation in the ways in which our education system attracts, prepares, supports, and 
develops expert teachers who can teach in more powerful ways. 
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An aspect of this transformation is developing means to evaluate and recognize teacher 
effectiveness throughout the career, for the purposes of licensing, hiring, and granting 
tenure; for providing needed professional development; and for identifying expert teachers 
who can be recognized and rewarded. A goal of such recognition is to keep talented teachers 
in the profession and to identify those who can take on roles as mentors, coaches, and 
teacher leaders who develop curriculum and professional learning opportunities, who 
redesign schools, and who, in some cases, become principals . 98 

It is important, then, for policymakers to think dearly about the assessment needs and goals of 
their particular context, to make careful dedsions among options, and to commit suffident 
resources for successful implementation. 

Since any teacher assessment system must address multiple goals, it should rely on multiple 
sources of information. At the moment, value-added assessment is being strongly promoted as a 
primary indicator of teacher effectiveness. However, policymakers should remember that good 

Since any teacher assessment system must address multiple goals, it 
should rely on multiple sources of information. 

policy reguires a sturdier base than momentary popularity. No value-added model provides a 
suffident and reliable indicator of teacher effectiveness. Adding to an already overwhelming 
consensus, the Economic Policy Institute recently convened a panel of the nation’s top experts, 
who reached the following conclusion: 

A review of the technical evidence leads us to conclude that, although standardized test 
scores of students are one piece of information for school leaders to use to make judgments 
about teacher effectiveness, such scores should be only part of an overall comprehensive 
evaluation. Some states are now considering plans that would give as much as 50% of the 
weight in teacher evaluation and compensation dedsions to scores on existing tests of basic 
skills in math and reading. Based on the evidence, we consider this unwise. Any sound 
evaluation will necessarily involve a balandng of many factors that provide a more accurate 
view of what teachers in fact do in the classroom and how that contributes to student 
learning. . . . [Tjhere is broad agreement among statistidans, psychometridans, and 
economists that student test scores alone are not suffidently reliable and valid indicators of 
teacher effediveness to be used in high stakes personnel dedsions, even when the most 
sophisticated statistical applications such as value-added modeling are employed . 99 

Policymakers interested in reliable teacher assessment must look beyond value-added scores, no 
matter how entidng some claims appear. 

Recommendations for Developing a Teacher Assessment System 

Based on the research reviewed, it is recommended that policymakers employ an assessment 
system that targets both continual improvement of the teaching staff and timely dismissal of 
teachers who cannot or will not improve. Steps toward that goal include that policymakers: 
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• Be clear about the purposes of any assessment before selecting strategies. 
Where formative and summative assessment are to be combined, plan to 
address the challenges of dual-purpose systems. 

• Involve all key stakeholders in system design. 

• Rather than employing a single assessment tool, gather evidence from 
multiple sources. Combine strategies so that the weakness of any single tool 
is offset by the strengths of another. 

• Be sure that the criteria for assessing performance, artifacts or other factors 
are credible and are well understood by teachers and assessors. 

• Provide high-quality, ongoing training for assessors and routinely calibrate 
their efforts to ensure consistent application of criteria. 

• Look to high-quality research on existing tools and programs to inform the 
design of assessment systems. 

• Commit sufficient resources to produce high-quality, productive 
assessment. 
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Appendix: Brief Summaries of Teacher Evaluation Methods 



Measure 


Description 


Research 


Strengths 


Cautions 


Classroom 


Used to measure 


Some highly 


• Provides rich 


• Careful attention 


Observations 


observable 


researched 


information 


must be paid to 


classroom 


protocols have been 


about classroom 


choosing or 


processes, 


found to link to 


behaviors and 


creating a valid 


including specific 


student 


activities. 


and reliable 


teacher 


achievement, 


• Is generally 


protocol and 


practices, holistic 


though associations 


considered a 


training and 


aspects of 


are sometimes 


fair and direct 


calibrating raters 


instruction, and 


modest. Research 


measure by 


• Classroom 


interactions 


and validity findings 


stakeholders. 


observation is 


between teachers 


are highly 


• Depending on 


expensive due to 


and students. 


dependent on the 


the protocol, 


cost of observers’ 


Can measure 


instrument used, 


can be used in 


time; intensive 


broad, 


sampling 


various 


training and 


overarching 


procedures, and 


subjects, 


calibrating of 


aspects of 


training of raters, 


grades, and 


observers adds to 


teaching or 


there is a lack of 


contexts. 


expense but is 


subject-specific 


research on 


• Can provide 


necessary for 


or context- 


observation 


information 


validity. 


specific aspects 


protocols as used in 


useful for both 


• This method 


of practice. 


context for teacher 


formative and 


assesses 




evaluation. 


summative 


observable 






purposes. 


classroom 








behaviors but is 








not as useful for 








assessing beliefs, 








feelings, 








intentions, or 








out-of-classroom 








activities. 


Principal 


Is generally based 


Studies comparing 


• Can represent a 


• Evaluation 


Evaluation 


on classroom 


subjective principal 


useful 


instruments used 




observation, 


ratings to student 


perspective 


without proper 




maybe by 


achievement find 


based on 


training or regard 




structured or 


mixed results. Little 


principals’ 


for their 




unstructured; 


evidence exists on 


knowledge of 


intended purpose 




uses and 


validity of 


school and 


will impair 




procedures vary 


evaluations as they 


context. 


validity. 




widely by district. 


occur in schools, but 


• Is generally 


• Principals may 




Is generally used 


evidence exists that 


feasible and can 


not be qualified 




for summative 


training for 


be one useful 


to evaluate 




purposes, most 


principals, is limited 


component in a 


teachers on 




commonly for 


and rare, which 


system used to 


measures highly 




tenure or 


would impair 


make 


specialized for 




dismissal 


validity of their 


summative 


certain subjects 




decisions for 


evaluations. 


judgments and 


or contexts. 




beginning 




provide 






teachers. 




formative 










feedback. 
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Measure 


Description 


Research 


Strengths 


Cautions 


Instructional 

Artifact 


Structured 
protocols used to 
analyze classroom 
artifacts in order 
to determine the 
quality of 
instruction in a 
classroom. May 
include lesson 
plans, teacher 
assignments, 
assessments, 
scoring rubrics, 
and student work. 


Pilot research has 
linked artifact 
ratings to observed 
measures of 
practice, quality of 
student work, and 
student 

achievement gains. 
More work is needed 
to establish scoring 
reliability and 
determine the ideal 
amount of work to 
sample. Lack of 
research exists on 
use of structured 
artifact analysis in 
practice. 


• Can be a useful 
measure of 
instructional 
quality if a 
validated 
protocol is 
used, if raters 
are well-trained 
for reliability, 
and if 

assignments 
show sufficient 
variation in 
quality. 

• Is practical and 
feasible 
because 
artifacts have 
already been 
created for the 
classroom. 


• More validity and 
reliability 
research is 
needed. 

• Training 
knowledgeable 
scorers can be 
costly but is 
necessary to 
ensure validity. 

• This method may 
be a promising 
middle ground in 
terms of 
feasibility and 
validity between 
full observation 
and less direct 
measures such as 
self-report. 


Portfolio 


Used to document 


Research on validity 


• Is 


• This method is 




a large range of 


and reliability is 


comprehensive 


time-consuming 




teaching 


ongoing, and 


and can 


on the part of 




behaviors and 


concerns have been 


measure 


teachers and 




responsibilities. 


raised about 


aspects of 


scorers; scorers 




Has been used 


consistency/stability 


teaching that 


should have 




widely in teacher 


in scoring. There is 


are not readily 


content 




education 


a lack of research 


observable in 


knowledge of the 




programs and in 


linking portfolios to 


the classroom. 


portfolios. 




states for 


student 


• Can be used 


• The stability of 




assessing the 


achievement. Some 


with teachers of 


scores may not 




performance of 


studies have linked 


all fields. 


be high enough 




teacher 


NBPTS certification 


• Provides a high 


to use for high- 




candidates and 


(which includes a 


level of 


stakes 




beginning 


portfolio) to student 


credibility 


assessment. 




teachers. 


achievement, but 


among 


• Portfolios are 






other studies have 


stakeholders. 


difficult to 






found no 


• Is a good tool 


standardize 






relationship. 


for teacher 


(compare across 








reflection and 


teaches or 








improvement. 


schools). 



• Portfolios 
represent 
teachers’ 
exemplary work 
but may not 
reflect everyday 
classroom 
activities. 
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Measure 


Description 


Research 


Strengths 


Cautions 


Teacher Self- 


Teacher reports 


Studies on the 


• Can measure 


• Reliability and 


Report 


of what they are 


validity of teacher 


unobservable 


validity of self- 


Measure 


doing in 


self-report measures 


factors that 


report is not fully 




classrooms. May 


present mixed 


may affect 


established and 




be assessed 


results. Highly 


teaching, such 


depends on 




through surveys, 


detailed measures 


as knowledge, 


instrument used. 




instructional logs, 


of practice may be 


intentions, 


• Using or creating 




and interviews. 


better able to 


expectation, 


a well-developed 




Can vary widely 


capture actual 


and beliefs. 


and validated 




in focus and level 


teaching practices 


• Provides the 


instrument will 




of detail. 


but may be harder 
to establish 
reliability or may 
result in very 
narrowly focused 
measures. 


unique 

perspective of 
the teacher. 

• Is very feasible 
and cost- 
efficient; can 
collect large 
amounts of 
information at 
once. 


decrease cost- 
efficiency but 
will increase 
accuracy of 
findings. 

• This method 
should not be 
used as a sole or 
primary measure 
in teacher 
evaluation. 



Student 


Used to gather 


Several studies have 


• Provides 


• Student ratings 


Survey 


student opinions 


shown that student 


perspective of 


have not been 




or judgments 


ratings of teachers 


students who 


validated for use 




about teaching 


can be useful in 


have the most 


in summative 




practice as part 


providing 


experience with 


assessment and 




of teacher 


information about 


teachers. 


should not be 




evaluation and to 


teaching; may be as 


• Can provide 


used as a sole or 




provide 


valid as judgments 


formative 


primary measure 




information about 


made by college 


information to 


of teacher 




teaching as it is 


students and other 


help teachers 


evaluation. 




perceived by 


groups; and, in 


improve 


• Students cannot 




students. 


some cases, may 
correlate with 
measures of student 
achievement. 
Validity is 
dependent on the 
instrument used and 
its administration 
and is generally 
recommended for 
formative use only. 


practice n a 
way that will 
connect with 
students. 

• Makes use of 
students, who 
may be as 
capable as adult 
raters at 
providing 
accurate 
ratings. 


provide 

information on 
aspects of 
teaching such as 
a teacher’s 
content 
knowledge, 
curriculum 
fulfillment, and 
professional 
activities. 
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Measure Description Research Strengths Cautions 



Value-Added 


Used to 


Little is known 


• Provides a way 


• Models are not 


Model 


determine 


about the validity of 


to evaluate 


able to sort out 




teachers’ 


value-added scores 


teachers’ 


teacher effects 




contributions to 


for identifying 


contribution to 


from classroom 




students’ test 


effective teaching, 


student 


effects. 




score gains. May 


though research 


learning, which 


• Vertical test 




also be used as a 


using value added 


most measures 


alignment is 




research tool 


models does suggest 


do not. 


assumed (i.e., 




(e.g.. 


that teachers differ 


• Requires no 


tests essentially 




determining the 


markedly in their 


classroom visits 


measure the 




distribution of 


contributions to 


because linked 


same thing from 




“effective” 


students’ test score 


student/teache 


grade to grade). 




teachers by 


gains. However, 


r data can be 


• Value-added 




student or school 


correlating value- 


analyzed at a 


scores are not 




characteristics). 


added scores with 
teacher 
qualifications, 
characteristics, or 
practices has 
yielded mixed 
results and few 
significant findings. 
Thus, it is obvious 
that teachers vary 
in effectiveness, but 
the reasons for this 
are not known. 


distance. 

• Entails little 
burden at the 
classroom or 
school level 
because most 
data is already 
collected for 
NCLB purposes. 

• May be useful 
for identifying 
upstanding 
teachers whose 
classrooms can 
serve as 
“learning labs” 
as well as 
struggling 
teachers in 
need of 
support. 


useful for 
formative 
purposes 
because 
teachers learn 
nothing about 
how their 
practices 
contributed to 
(or impeded) 
student 
learning. 

• Value-added 
measures are 
controversial 
because they 
measure only 
teachers’ 
contributions to 
student 
achievement 
gains on 
standardized 
tests. 



Source: Goe, L., Bell, C., & Little, O. (2008). Approaches to evaluating teacher effectiveness: A research synthesis. 
Washington, D.C. : National Comprehensive Center for Teacher Quality, p. 16- 19. Reproduced by permission of Laura 
Goe. 
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